mean_poisson_deviance#

mean_poisson_deviance is a regression loss for non-negative targets (typically counts). It is the (weighted) average of the Poisson deviance:

  • best value: 0.0 (when y_pred == y_true)

  • lower is better

Goals

  • define the metric and its domain

  • derive the formula from the Poisson likelihood

  • implement it from scratch in NumPy (and validate against scikit-learn)

  • build intuition with plots

  • use it as an optimization objective for Poisson regression (GLM)

import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import mean_poisson_deviance

pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

rng = np.random.default_rng(42)
np.set_printoptions(precision=4, suppress=True)

When should you use it?#

Use Poisson deviance when:

  • y_true represents counts (\(0,1,2,\dots\)) or other non-negative quantities

  • the conditional variance grows with the mean (a Poisson-like setting)

  • you want a loss that matches the Poisson likelihood (GLM / Poisson regression)

Domain constraints

For every sample:

  • \(y_i \ge 0\)

  • \(\mu_i > 0\) (predicted mean)

A common way to ensure \(\mu_i>0\) is a log link:

\[ \eta_i = \beta_0 + x_i^\top\beta,\qquad \mu_i = \exp(\eta_i) > 0. \]

Definition#

For a single observation \(y \ge 0\) and a prediction \(\mu > 0\), the Poisson unit deviance is:

\[ d(y, \mu) = 2 \left[ y \log\left(\frac{y}{\mu}\right) - (y - \mu) \right], \qquad \text{with the convention } 0\log(0/\mu)=0. \]

The mean Poisson deviance is the average over samples:

\[ \mathrm{MPD}(y, \mu) = \frac{1}{n}\sum_{i=1}^n d(y_i, \mu_i) \quad (\text{or a weighted average}). \]

Useful special case:

\[ d(0, \mu)=2\mu. \]

For \(y>0\), writing \(r = \mu/y\) (multiplicative error):

\[ d(y,\mu) = 2y\,[ -\log r - 1 + r ]. \]

Properties:

  • \(d(y,\mu) \ge 0\)

  • \(d(y,\mu)=0\) iff \(\mu=y\)

Where does this formula come from?#

For a Poisson model:

\[ y \mid \mu \sim \mathrm{Poisson}(\mu), \qquad \mu>0, \]

the log-likelihood for one observation is:

\[ \log p(y\mid \mu) = y\log\mu - \mu - \log(y!). \]

The deviance is a log-likelihood ratio between the fitted model and a saturated model that predicts perfectly (\(\mu=y\)):

\[ d(y,\mu) = 2\Big(\log p(y\mid y) - \log p(y\mid \mu)\Big) = 2\left[y\log\left(\frac{y}{\mu}\right) - (y-\mu)\right]. \]

So minimizing mean Poisson deviance is equivalent to maximizing the Poisson likelihood (the saturated term depends only on \(y\)).

KL view

The KL divergence between Poisson distributions satisfies:

\[ \mathrm{KL}\big(\mathrm{Poisson}(y)\ \|\ \mathrm{Poisson}(\mu)\big) = y\log\left(\frac{y}{\mu}\right) + \mu - y, \]

so:

\[ d(y,\mu) = 2\,\mathrm{KL}\big(\mathrm{Poisson}(y)\ \|\ \mathrm{Poisson}(\mu)\big). \]
def _to_1d_float(a, name: str) -> np.ndarray:
    a = np.asarray(a, dtype=float)
    a = a.reshape(-1)
    if not np.all(np.isfinite(a)):
        raise ValueError(f"{name} must contain only finite values")
    return a


def poisson_deviance_per_sample(y_true, y_pred) -> np.ndarray:
    # Per-sample Poisson deviance d(y, μ)
    y_true = _to_1d_float(y_true, "y_true")
    y_pred = _to_1d_float(y_pred, "y_pred")

    if y_true.shape != y_pred.shape:
        raise ValueError("y_true and y_pred must have the same shape")
    if np.any(y_true < 0):
        raise ValueError("y_true must be non-negative")
    if np.any(y_pred <= 0):
        raise ValueError("y_pred must be strictly positive")

    dev = np.empty_like(y_true, dtype=float)

    mask = y_true > 0
    dev[~mask] = 2.0 * y_pred[~mask]  # y=0 => d(0, μ) = 2μ

    dev[mask] = 2.0 * (
        y_true[mask] * np.log(y_true[mask] / y_pred[mask])
        - (y_true[mask] - y_pred[mask])
    )

    return dev


def mean_poisson_deviance_np(y_true, y_pred, sample_weight=None) -> float:
    dev = poisson_deviance_per_sample(y_true, y_pred)

    if sample_weight is None:
        return float(dev.mean())

    w = _to_1d_float(sample_weight, "sample_weight")
    if w.shape != dev.shape:
        raise ValueError("sample_weight must have the same shape as y_true")
    if np.any(w < 0):
        raise ValueError("sample_weight must be non-negative")

    w_sum = float(w.sum())
    if w_sum == 0.0:
        raise ValueError("sample_weight sum must be positive")

    return float(np.sum(w * dev) / w_sum)


def mean_poisson_deviance_grad_mu(y_true, y_pred, sample_weight=None) -> np.ndarray:
    # Gradient of the mean deviance wrt μ (elementwise), shape (n,)
    y_true = _to_1d_float(y_true, "y_true")
    y_pred = _to_1d_float(y_pred, "y_pred")

    if y_true.shape != y_pred.shape:
        raise ValueError("y_true and y_pred must have the same shape")
    if np.any(y_true < 0):
        raise ValueError("y_true must be non-negative")
    if np.any(y_pred <= 0):
        raise ValueError("y_pred must be strictly positive")

    # d(y, μ) = 2 [ y log(y/μ) - (y - μ) ]
    # ∂/∂μ d(y, μ) = 2 (1 - y/μ)
    grad = 2.0 * (1.0 - y_true / y_pred)

    if sample_weight is None:
        return grad / y_true.size

    w = _to_1d_float(sample_weight, "sample_weight")
    if w.shape != grad.shape:
        raise ValueError("sample_weight must have the same shape as y_true")

    w_sum = float(w.sum())
    if w_sum == 0.0:
        raise ValueError("sample_weight sum must be positive")

    return (w * grad) / w_sum
# Compare against scikit-learn

y_true = rng.poisson(lam=3.0, size=200)
y_pred = rng.gamma(shape=2.0, scale=2.0, size=200)  # positive

print("sklearn:", mean_poisson_deviance(y_true, y_pred))
print("numpy  :", mean_poisson_deviance_np(y_true, y_pred))

w = rng.uniform(0.0, 1.0, size=y_true.size)
print("sklearn (w):", mean_poisson_deviance(y_true, y_pred, sample_weight=w))
print("numpy   (w):", mean_poisson_deviance_np(y_true, y_pred, sample_weight=w))

# Tiny gradient sanity check (finite differences)
y_true_s = np.array([0.0, 1.0, 4.0])
y_pred_s = np.array([0.7, 1.2, 3.5])

g_analytical = mean_poisson_deviance_grad_mu(y_true_s, y_pred_s)

eps = 1e-6
base = mean_poisson_deviance_np(y_true_s, y_pred_s)
g_numeric = np.zeros_like(y_pred_s)
for i in range(y_pred_s.size):
    y_pred_eps = y_pred_s.copy()
    y_pred_eps[i] += eps
    g_numeric[i] = (mean_poisson_deviance_np(y_true_s, y_pred_eps) - base) / eps

print("grad analytical:", g_analytical)
print("grad numeric   :", g_numeric)
sklearn: 2.606757123122003
numpy  : 2.606757123122003
sklearn (w): 2.4106910529758423
numpy   (w): 2.4106910529758423
grad analytical: [ 0.6667  0.1111 -0.0952]
grad numeric   : [ 0.6667  0.1111 -0.0952]

Intuition: how does the penalty behave?#

Key behaviors to look for in the plots below:

  • The minimum is at \(\mu=y\).

  • For \(y>0\), as \(\mu\to 0^+\), the loss diverges (\(\log(y/\mu)\to\infty\)).

  • For \(y=0\), the deviance becomes linear: \(d(0,\mu)=2\mu\).

  • In terms of multiplicative error \(r=\mu/y\) (for \(y>0\)), the shape is independent of \(y\):

\[ \frac{d(y,\mu)}{y} = 2\,[ -\log r - 1 + r ]. \]

So larger counts scale the deviance roughly linearly.

mu = np.logspace(-3, np.log10(50), 600)
ys = [0, 1, 5, 20]

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        "Unit deviance d(y, μ) as a function of μ",
        "Shape vs multiplicative error r = μ / y (for y>0)",
    ),
)

# Left: d(y, μ) vs μ for several y
for y in ys:
    y_vec = np.full_like(mu, float(y))
    d = poisson_deviance_per_sample(y_vec, mu)
    fig.add_trace(go.Scatter(x=mu, y=d, mode="lines", name=f"y={y}"), row=1, col=1)

fig.update_xaxes(title_text="μ (prediction)", type="log", row=1, col=1)
fig.update_yaxes(title_text="d(y, μ)", row=1, col=1)

# Right: normalized deviance shape as a function of r
r = np.logspace(-2, 2, 600)
d_over_y = 2.0 * (-np.log(r) - 1.0 + r)  # since d(y, y r) / y
fig.add_trace(go.Scatter(x=r, y=d_over_y, mode="lines", name="d/y (any y>0)"), row=1, col=2)

fig.update_xaxes(title_text="r", type="log", row=1, col=2)
fig.update_yaxes(title_text="d(y, y·r) / y", row=1, col=2)

fig.update_layout(title="Mean Poisson deviance: shape and scaling")
fig.show()

Pros, cons, and common pitfalls#

Pros

  • Proper loss for counts: corresponds to a Poisson likelihood / GLM.

  • Respects positivity: large penalties for predicting near-zero when events occur.

  • Multiplicative intuition: depends on the ratio \(r=\mu/y\) (for \(y>0\)), not just additive error.

  • Convenient for optimization: with a log link, the gradient is simple: \(\nabla_\beta L \propto X^\top(\mu-y)\).

Cons / pitfalls

  • Requires \(y\ge 0\) and \(\mu>0\).

  • Can be dominated by large counts / outliers (since deviance scales with \(y\)).

  • Assumes a Poisson-like mean–variance relationship (\(\mathrm{Var}(y\mid x)\approx\mu\)). For over-dispersion, consider Negative Binomial or Tweedie models.

Practical tips

  • Enforce \(\mu>0\) with a log link (μ = exp(η)).

  • For rates with exposures \(E_i\) (time at risk, population, etc.), use an offset:

\[ \mu_i = E_i\,\exp(x_i^\top\beta) \quad\Leftrightarrow\quad \log\mu_i = \log E_i + x_i^\top\beta. \]
  • In from-scratch code, clip \(\eta\) before exp to avoid overflow.

Exercises#

  1. Add an exposure offset and simulate data where events are proportional to exposure.

  2. Implement mini-batch gradient descent and compare convergence.

  3. Create an over-dispersed dataset (e.g., Gamma–Poisson / Negative Binomial) and compare Poisson deviance against another deviance (e.g., Tweedie).

References#

  • scikit-learn: sklearn.metrics.mean_poisson_deviance

  • scikit-learn: sklearn.linear_model.PoissonRegressor

  • McCullagh & Nelder, Generalized Linear Models